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Summary 

When  asked  to  assess  the  probability  that  each  of  their 
answers  to  a  set  of  questions  is  correct,  people  typically  ex¬ 
hibit  overconfidence;  the  proportion  of  answers  correct  for  the 
probability  values  are  too  small.  The  present  study 

attempted  to  improve  the  appropriateness  of  confidence  judgments 
by  having  people  sort  their  responses  to  a  group  of  general 
knowledge  items  into  a  number  of  piles,  each  reflecting  a  dif¬ 
ferent  level  of  confidence  in  their  answers.  However,  this  pro¬ 
cedure  had  no  consistent  effect  on  overconfidence,  even  though  it 
differed  in  many  ways  from  previous  unsuccessful  efforts  to 
reduce  this  bias.  Implications  for  future  studies  of  the  over- 
confidence  phenomenon  are  discussed. 
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Categorical  Confidence 

In  a  typical  probability  assessment  task,  participants  first 
ponder  some  question  of  fact  and  then  assess  the  'ikelihood  that 
the  answers  they  have  produced  are  correct.  Casual  observation 
of  such  individuals  suggests  that  they  spend  considerably  more 
time  on  the  first  of  these  operations  than  on  the  second.  A  va¬ 
riety  of  possible  reasons  spring  to  mind:  (a)  answers  are  harder 
to  produce  than  probabilities;  therefore  they  require  more  time, 
(b)  we  are  more  experienced  in  answering  questions;  hence,  we  can 
spend  more  time  profitably  on  that  task,  (c)  until  an  answer  is 
produced,  one  cannot  even  begin  to  assess  its  accuracy,  or  (d)  we 
are  more  accustomed  to  having  our  answers  evaluated  than  our 
probabilities  and  want  to  take  greater  care  that  the  former  are 
in  order. 

Given  these  reasons  for  deemphasizing  the  probability 
assessment  task,  it  should  perhaps  come  as  no  surprise  then  that 
its  quality  is  often  poor.  The  most  commonly  observed  result  is 
that  the  magnitude  of  probability  assessments  is  only  roughly 
predictive  of  the  actual  likelihood  that  the  associated  answers 
will  be  correct.  In  most  cases,  correctness  does  increase  as 
confidence  increases.  However,  it  increases  too  slowly.  In  man: 
tasks,  as  people's  assessed  probabilities  of  being  correct  in¬ 
crease  from  .5  to  1.0,  their  actual  probability  of  being  correct 
increases  from  .5  to  only  about  .8.  People  believe  that  they  can 
distinguish  between  a  greater  range  of  states  of  knowledge  than 
is  actually  the  case. 

When  tasks  are  difficult,  a  contrast  between  people’s  over¬ 
all  confidence  and  their  overall  accuracy  reveals  overconfidence; 
they  make  too  many  high  confidence  assessments.  With  easy  tasks, 
one  finds  underconfidence.  These  patterns  are  very  robust;  they 
can  be  found  with  a  variety  of  response  modes,  question  topics, 
and  levels  of  expertise  (for  reviews,  see  Fischhoff,  1982;  Lich¬ 
tenstein,  Fischhoff  &  Phillips,  1982).  People  have,  moreover, 
cor'  iderable  confidence  in  these  confidence  assessments  (e.g., 
F.schhoff,  Slovic  &  Lichtenstein,  1977). 
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The  few  experimental  manipulations  that  have  managed  to  im¬ 
prove  the  appropriateness  of  confidence  assessments  have  typic¬ 
ally  involved  focusing  people's  attention  on  the  assessment  task 
in  a  fairly  directive  manner.  For  example ,  the  quality  of 
assessment  improves  when  assessors  are  given  extensive  personal¬ 
ised  feedback  (e.g.,  Lichtenstein  &  Fischhoff,  1980;  Murphy  & 
Winkler ,  1977).  Another  effective  manipulation  has  been  requir¬ 
ing  people  to  list  explicitly  reasons  supporting  and  contradict¬ 
ing  their  choice  of  answer,  prior  to  assessing  the  likelihood 
that  it  is  correct  (Koriat,  Lichtenstein  &  Fischhoff,  1980). 

The  secret  of  even  these  partial  successes  is,  however, 
still  unclear.  It  would  be  theoretically  interesting  and  prac¬ 
tically  useful  if  such  simple  manipulations  were  able  to  enhance 
people' 8  ability  to  appraise  their  own  knowledge.  However,  the 
improvement  observed  with  these  manipulations  might  come  not  from 
helping  people  focus  on  the  assessment  task,  but  from  some  unin¬ 
tended  cues  as  to  how  subjects  should  change  their  assessments. 
Because  one  does  not  ordinarily  list  contradicting  reasons,  the 
requirement  to  do  so  might  be  interpreted  by  some  subjects  as  a 
hint  to  reduce  their  confidence.  Feedback  shows  what  assessments 
one  should  have  used;  it  may  be  tempting  just  to  reduce  one's 
probability  assessments  mechanically. 

An  obvious  danger  with  such  directive  procedures  is  that 
whatever  is  learned  may  prove  to  be  task  specific,  leaving  one  no 
better  (or  even  more  poorly)  prepared  to  face  a  new  task  differ¬ 
ing,  say,  in  difficulty  level.  Learning  that  one  is  overconfi¬ 
dent  on  a  hard  task  might,  in  fact,  induce  exaggerated  undercon¬ 
fidence  on  a  subsequent  easy  task.  These  fears  are  alleviated 
somewhat  by  Lichtenstein  and  Fischhoff 's  (1980)  finding  of  modest 
generalization  of  training  to  some  other  tasks.  Nonetheless,  it 
would  be  comforting  to  know  that  confidence  assessment  could  be 
improved  by  a  technique  that  affected  response  usage  only  as  a 
by-product  of  affecting  understanding  of  how  much  one  knows. 

One  simple,  non-directive  way  to  focus  attention  on  the 
assessment  task  would  be  to  provide  people  with  a  detailed 
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lecture  on  the  nature  of  the  response  mode,  the  properties  of 
good  assessments,  and  the  kinds  of  biases  that  may  be  observed. 
Such  instruction  would  prepare  people  for  assessment  in  general, 
not  just  for  one  particular  task.  Unfortunately,  however,  it 
does  not  seem  to  work  (Lichtenstein  &  Fischhoff,  Note  1). 

The  present  experiment  explores  an  alternative  non-directive 
approach  that  differs  in  many  ways  from  its  predecessors.  In  it, 
judges  first  answer  an  entire  set  of  two-alternative  forced- 
choice  questions.  Then  they  sort  the  questions  into  a  prescribed 
number  of  piles,  each  reflecting  a  different  degree  of  confidence 
that  the  answers  chosen  for  the  items  assigned  to  it  are  correct. 
Finally,  after  reviewing  the  results  of  the  sorting  procedure, 
judges  assign  a  number  to  each  pile  expressing  the  probability 
that  each  item  in  the  pile  is  correct.  This  procedure  should 
emphasize  confidence  assessment  over  question  answering.  More¬ 
over,  within  the  assessment  task  it  should  focus  attention  on 
appraising  one's  feeling  of  knowing  more  than  on  the  production 
of  some  numerical  expression  of  that  feeling  that  the  experi¬ 
menter  will  find  acceptable.  Some  explicit  response  is,  of 
course,  needed  to  communicate  one's  degree  of  confidence,  but  the 
careful  formulation  of  a  feeling  of  knowing  should  take  prece¬ 
dence  over  the  more  technical  task  of  translating  it  into  a 
number. 

One  respect  in  which  the  present  procedure  is  directive  is 
in  its  specification  of  the  number  of  categories  that  subjects 
are  to  use.  That  number  might  be  reasonably  interpreted  by 
subjects  as  an  indication  of  how  many  distinct  categories  they 
can  reliably  use.  There  is  probably  no  way  to  avoid  giving  some 
direction  to  this  topic.  For  example,  the  non-categorical  half¬ 
range  probability  scale  [.5,  1.0]  used  in  many  studies  seems  to 
suggest  to  subjects  that  they  can  and  should  use  all  the  "round" 
responses  (.5,  .6,  .7,  .8,  .9,  1.0).  One  might  even  attribute 
the  hypersensitivity  observed  in  such  studies  to  this  implicit 
suggestion  that  they  are  able  to  make  the  discriminations  corres¬ 
ponding  to  these  six  distinct  levels  of  knowledge. 
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A  final  feature  of  this  procedure  that  might  have  a  salutary 
effect  is  that  it  forces  subjects  to  read  the  entire  set  of  ques¬ 
tions  before  assessing  their  confidence  in  any.  Upon  entering  an 
experiment,  subjects  may  have  some  expectation  regarding  how  dif¬ 
ficult  the  questions  will  be.  If  that  expectation  is  erroneous, 
it  might  artificially  buoy  or  depress  their  confidence  levels 
until  they  had  completed  enough  questions  to  realize  that  their 
assumption  was  in  error. 


Experiment  1 

Method 

Design.  The  experimental  desiei  involved  four  groups  of 
subjects,  each  asked  to  sort  50  two-alternative  questions  into  a 
prespecified  number  of  piles  (3,  4,  5,  or  6)  according  to  their 
degree  of  confidence  in  knowing  the  correct  answer  to  each. 

After  the  sorting  was  completed,  they  assessed  the  probability 
that  each  answer  in  each  pile  was  correct. 

Procedure.  The  details  may  be  best  understood  by  verbatim 
citation  of  relevant  portions  of  the  experimental  instructions: 

For  this  task,  we  have  prepared  50  general-knowledge 
items.  Each  item  has  two  alternative  answers,  one  of 
which  is  correct  and  one  incorrect.  Each  item  appears 
on  a  card.  Your  job  is  to: 

Step  1 — Separate  the  50  cards,  tearing  them  along 
the  dotted  lines  (there  are  six  (6)  cards  on  each 
page ) . 

Step  2 — Go  through  the  cards  and  circle  the  letter  a^ 
or  the  letter  J3  to  indicate  which  of  the  alternatives 
you  think  is  the  correct  alternative.  If  you  have  no 
idea  which  alternative  is  correct,  circle  one  of  the 
two  letters  anyway — just  guess. 

Step  3 — Sort  the  cards  into  3  [or  4,  5,  or  6]  piles 
according  to  how  sure  you  are  that  you  have  circled  the 
correct  alternative. 
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*  One  pile  should  contain  all  the  cards  for  which 
you  feel  least  confident; 

*  One  pile  should  contain  all  the  cards  for  which 
you  feel  most  confident; 

*  The  other  pile[s]  will  have  cards  for  which  you 
have  an  intermediate  feeling [s]  of  sureness. 

Keep  sorting  and  resorting  until  all  the  cards  in  a 
particular  pile  are  those  for  which  you  feel  the  same 
level  of  certainty  or  uncertainty. 

You  may,  if  you  wish,  do  steps  2  and  3  at  the  same 
time.  That  is,  you  could  take  the  first  card,  circle 
an  answer,  and  immediately  use  that  card  to  start  one 
pile.  Then  take  the  second  card,  mark  an  answer  on  it, 
and  then  put  it  in  a  pile.  And  so  on. 

Do  not  hesitate  to  rearrange  the  cards,  moving  them 
from  pile  to  pile  as  needed. 

Step  4 — When  you  are  satisfied  with  your  sorting, 
you  must  assign  a  number  to  each  pile.  This  number  ex¬ 
presses  the  probability,  for  each  card  in  that  pile, 
that  you  have  indeed  circled  the  correct  alternative. 
This  number  expresses  numerically  the  degree  of  cer¬ 
tainty  or  uncertainty  that  you  feel  about  each  of  the 
cards  in  the  pile. 

The  number  you  assign  to  each  pile  may  be  any  number 
from  .5  to  1.0.  ".5"  means  that,  for  each  card  in  the 

pile,  you  felt  completely  uncertain  as  to  which  of  the 
two  answers  is  the  correct  answer.  The  number  ".6" 
means  that  for  each  card  in  the  pile,  you  felt  60%  sure 
that  you  selected  the  correct  answer  and  so  forth.  The 
number  "1.0"  means  that  you  are  completely  sure  that 
you  have  selected  the  correct  answer  for  every  card  in 
the  pile. 

*  All  the  cards  in  one  pile  must  be  assigned  the 
same  probability. 

*  Every  pile  must  have  a  different  probability. 
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*  You  must  use  numbers  from  .5  to  1.0  inclusive# 
but  you  may  pick  any  numbers  from  that  range  that  seem 
appropriate.  You  do  not  have  to  use  the  numbers  1.0 
and  .5#  but  you  may  if  they  adequately  express  your 
degree  of  certainty/uncertainty  for  your  most  extreme 
piles#  the  ones  you  feel  least  and  most  confident 
about. 

*  You  may  use  two-digit  numbers  (like  .55  or  .75) 
if  you  wish. 

*  Do  not  use  numbers  like  .4  or  1.2  that  are  out¬ 
side  the  range  .5  to  1.0. 

Steps  5  and  6  told  subjects  how  to  write  their  responses# 
reemphasizing  several  key  points  and  informing  them  that  they 
would  have  40  minutes  to  complete  the  task.  In  studies  using  the 
usual  numerical  response  format#  answering  50  questions  typically 
consumes  about  15  minutes#  once  instructions  have  been  completed. 

Items.  In  order  to  facilitate  comparisons  between  these  re¬ 
sponses  and  those  produced  by  the  usual  numerical  response  for¬ 
mat#  an  item  set  was  used  that  had  been  tested  previously  on 
subjects  drawn  from  the  same  pool.  Specifically#  it  was  the 
"complete  test/hard  items"  set,  reported  in  Figure  9  by  Lichten¬ 
stein  and  Fischhoff  (1977).  Subjects  there  knew  the  answers  to 
61.7%  of  the  items#  and  responded  with  a  mean  probability  of 
.758#  reflecting  substantial  overconfidence. 

Subjects.  One  hundred  seventy-five  individuals  partici¬ 
pated#  distributed  over  the  four  experimental  groups  according  to 
their  preference  for  the  time  at  which  the  different  groups  were 
conducted. 

This  task  was  the  first  of  several  unrelated  tasks  presented 
in  sessions  lasting  approximately  1V2  hours.  Subjects  were  paid 
$6#  and  were  recruited  through  an  advertisement  in  the  University 
of  Oregon  student  newspaper. 

Results 


Response  usage.  When  the  original  group  of  subjects  (Lich- 
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tenstein  &  Fischhoff,  1977)  responded  to  these  items,  the  great 
majority  (35  of  48)  used  six  response  categories.  Moreover,  all 
but  one  of  these  individuals  used  the  six  "natural"  responses 
(.5,  .6,  .7,  .8,  .9,  1.0).  In  the  entire  group,  all  but  two 
subjects  used  .5,  indicating  "guess”;  all  but  one  used  1.0,  indi¬ 
cating  complete  confidence.  The  bottom  rows  of  Table  1  describe 
the  responses  of  these  subjects,  both  for  the  entire  group  and 
for  those  who  used  just  six  response  categories.  The  first  col¬ 
umns  are  devoted  to  response  usage. 

The  top  four  lines  of  Table  1  show  how  the  subjects  in  the 
present  experiment  coped  with  the  constraint  of  not  being  able  to 
make  all  possible  responses.  For  those  who  sorted  into  six 
piles,  this  should  have  been  a  minimal  constraint.  Indeed,  most 
did  avail  themselves  of  the  .5  and  1.0  options.  Nonetheless,  the 
constraint  did  have  some  effect,  in  that  22  of  the  45  six-pile 
subjects  did  not  use  the  six  "natural"  responses,  preferring 
other  intermediate  values  between  .5  and  1.0.  The  subjects  who 
were  allowed  five  categories  typically  chose  to  give  up  one  of 
the  intermediate  responses,  rather  than  one  of  the  extreme  re¬ 
sponses,  each  of  which  was  still  used  by  92.1%  of  the  subjects. 
The  increasing  constraints  on  the  four-pile  and  three-pile  groups 
led  to  reduced  usage  of  1.0,  but  not  of  .5.  That  is,  "guess” 
proved  to  be  a  more  essential  response  than  "certain."  When 
subjects  in  the  five-  and  six-pile  groups  (and  in  the  original 
study)  failed  to  use  1.0,  their  highest  response  was  always  in 
the  .90-. 99  range.  A  number  of  the  subjects  in  the  three-  and 
four-pile  groups  had  highest  responses  less  than  .9. 

Performance.  Given  these  differences  in  response  usage, 
there  is  some  reason  to  expect  differences  in  performance. 

Figure  1  and  the  remainder  of  Table  1  provide  pertinent 
details.  The  calibration  curves  in  Figure  1  show  the  percentage 
of  correct  responses  associated  with  the  mean  confidence  for  each 
level  of  confidence  expressed  by  subjects  (after  collapsing  those 
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expressions  into  the  categories,  .5-. 59,  .6-. 69,  .7-. 79,  .8-. 89, 
.9-. 99,  and  1.0).  The  similarities  between  these  curves  are  more 
striking  than  are  any  differences.  The  curves  for  the  various 
sort-and-label  groups  closely  resemble  one  another;  perhaps  more 
importantly,  they  also  resemble  the  curve  for  the  unconstrained 
group  from  Lichtenstein  and  Fischhoff  (1977).  If  the  four  sort- 
and-label  groups  are  pooled,  the  resulting  curve  falls  very  close 
to  the  unconstrained  group's  curve.  Sorting  per  se  seems  to  have 
had  no  effect. 

This  conclusion  is  generally  borne  out  by  the  summary 
statistics  of  Table  1.  The  proportion  correct  column  suggests 
that  the  focus  on  probability  assessment  may  have  slightly 
reduced  the  attention  subjects  paid  to  question  answering;  the 
mean  for  all  sort  groups  was  .595,  compared  with  .617  for  the  un¬ 
constrained  group.  Their  mean  confidence  was  correspondingly 
lower  (.717  vs.  .737).  As  a  result,  the  sort  and  non-sort  groups 
have  similarly  high  levels  of  overconfidence,  which  is  computed 
as  the  signed  differences  between  mean  confidence  and  proportion 
correct.  The  various  groups  expressed  confidence  that  was  too 
confident  by  .11  to  .14  on  the  average. 

"Calibration"  is  a  statistic  characterizing  curves  such  as 
those  in  Figure  1.  It  is  the  mean  squared  distance  between  each 
point  in  a  curve  and  the  identity  line  representing  perfect  cali¬ 
bration,  weighted  by  the  number  of  responses  summarized  in  each 
point.  Ideally,  it  should  be  0.  These  levels,  too,  are  similar 
in  the  sort  groups  and  unconstrained  group,  confirming  the  visual 
impression  from  the  figure. 

Certainty.  The  most  extreme  overconfidence  has  typically 
been  observed  with  responses  of  1.0,  all  of  which  should  be  asso¬ 
ciated  with  correct  answers.  The  final  two  columns  show  that  the 
sorting  procedure  did  reduce  the  usage  of  1.0  (as  was  shown  by 
the  third  column),  which  comprised  one  quarter  of  the  uncon¬ 
strained  group's  responses.  However,  it  did  not  affect  the  cor- 


Figure  1.  Calibration  curves  for  the  3-,  4-,  5-,  and  6-pile 
groups  of  Experiment  1,  compared  with  the  calibration  of 
subjects  in  Figure  9  of  Lichtenstein  and  Fischhoff  (1977). 
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rectness  of  the  associated  answers.  Subjects  were  still  wrong 
about  20%  of  the  time  when  they  expressed  certainty  that  they 
were  right. 

Fischhoff  and  MacGregor  (in  press)  observed  in  an  uncon¬ 
strained  task  subjects  who  never  used  the  1.0  responses  were 
somewhat  better  calibrated  than  other  subjects.  This  was  not  the 
case  in  the  present  study.  The  37  non-users  of  1.0  were  not  ap¬ 
preciably  better  calibrated  than  the  138  users  (figure  not 
shown).  Unfortunately  for  the  sake  of  this  comparison,  non-users 
of  1.0  also  had  a  lower  proportion  of  correct  answers  than  did 
users  (.566  vs.  .603).  Because  calibration  typically  deterior¬ 
ates  as  task  difficulty  increases  (Lichtenstein  6  Fischhoff, 
1977),  comparisons  are  somewhat  ambiguous  when  difficulty  varies. 

Discussion 

Although  the  sorting  task  affected  subjects'  choice  of  re¬ 
sponses,  it  does  not  seem  to  have  affected  the  appropriateness  of 
those  responses.  Perhaps  the  only  glimmer  of  an  effect  is  the 
slight  superiority  of  the  groups  using  fewer  categories.  Sub¬ 
jects  in  the  three-pile  and  four-pile  groups  had  a  bit  better 
overall  calibration  than  subjects  using  five  or  six  piles, 
despite  having  a  slightly  lower  percentage  of  correct  answers. 
Considering  the  variety  of  ways  in  which  the  present  task  dif¬ 
fered  from  its  predecessors,  this  is  a  meager  haul.  Accepting  it 
at  face  value  would  lead  one  to  believe  that  the  appropriateness 
of  people's  confidence  cannot  be  improved  by  any  of  the  changes 
from  the  usual  assessment  procedure  embodied  in  the  sorting 
task:  focusing  attention  on  confidence  assessment,  comparing 
knowledge  levels  on  different  items,  reducing  the  number  of  re¬ 
sponses  used,  and  eliminating  whatever  implicit  cues  are  provided 
by  the  usual  response  format. 

Before  accepting  this  conclusion,  we  decided  to  repeat  the 
study  using  small  group  rather  than  large  group  administration, 
with  the  experimenter  close  at  hand  to  answer  any  questions  that 
arose.  Although  such  proximity  raises  slightly  the  risk  of 
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experimenter  interference,  it  also  reduces  the  risk  that  subjects 
deviated  from  the  prescribed  task.  Although  subjects  in  Experi¬ 
ment  1  appeared  to  work  quite  hard,  the  groups  were  too  large  to 
ensure  that  every  subject  performed  the  task  in  the  desired  se¬ 
quence.  Group  size  may  also  have  inhibited  some  subjects  from 
asking  clarifying  questions  regarding  what  might  have  seemed  a 
moderately  complicated  procedure. 

Experiment  2 

Method 

Experiment  2  repeated  the  three-  and  six-pile  groups  of 
Experiment  1  in  order  to  see  what,  if  any,  effect  would  be  ob¬ 
tained  with  the  most  extreme  versions  of  the  sort-and-label  man¬ 
ipulations.  Instead  of  large  group  administration,  groups  of 
about  five  people  were  brought  to  a  small  conference  room.  The 
experimenter  read  the  instructions  with  them,  discussed  any 
questions,  and  remained  during  the  course  of  the  task.  The  con¬ 
tinual  presence  of  the  experimenter  made  it  possible  to  ensure 
that  subjects  were  following  the  instructions.  The  presence  of 
other,  hardworking  subjects  seemed  to  encourage  them  to  do  so. 

Subjects  were  recruited  through  the  local  state  employment 
office.  All  had  at  least  one  year  of  higher  education,  making 
them  generally  comparable  in  educational  background  to  the 
subjects  in  Experiment  1.  Each  individual  was  paid  $8  for 
working  two  hours  on  completing  this  and  a  number  of  subsequent 
unrelated  judgment  tasks.  Most  subjects  completed  this  task 
within  20  minutes,  not  including  the  10-15  minutes  required  for 
the  experimenter  to  read  and  discuss  the  instructions. 

Results 

Response  usage.  The  basic  patterns  of  Experiment  1  were  re¬ 
peated.  Of  the  30  six-pile  subjects,  only  9  did  not  use  the 
natural  responses  (.5,  ...  ,  1.0);  of  these  9,  only  three  did  not 
use  one  of  the  extreme  categories  (.5,  1.0).  As  before,  three- 
pile  subjects  made  somewhat  less  use  of  .5,  and  considerably  less 
use  of  1.0.  They  used  a  wide  variety  of  response  sets;  even  the 
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most  popular  (.5,  .7,  1.0)  was  chosen  by  only  5  people.  Details 
appear  in  Table  2. 

Performance.  The  various  performance  statistics  show  the 
sorting  groups  as  a  whole  to  be  quite  similar  to  the  unconstrained 
group.  Though  the  proportion  of  correct  answers  for  both  sort 
groups  was  slightly  superior  to  the  unconstrained  groupr  this 
difference  was  also  reflected  in  a  somewhat  higher  level  of  con¬ 
fidence  for  the  sort  groups.  The  sorting  and  unconstrained 
groups  were  compatible  on  the  remaining  measures.  The  one  dif¬ 
ference  of  note  that  does  emerge  is  between  the  two  sort  groups. 
The  three-pile  group  was  better  calibrated  and  less  overconfident 
than  the  six-pile  group.  This  can  be  seen  in  the  summary  statis¬ 
tics  of  Table  2  and  in  the  graphic  representation  of  Figure  2. 

The  six-pile  group  here  actually  performed  worse  than  the  uncon¬ 
strained  group,  most  of  whom  used  six  responses  spontaneously. 

A  second  modest  effect  is  that  the  22  subjects  (20  from  the 
three-pile  group  and  two  from  the  six-pile  group)  who  did  not  use 
1.0  were  somewhat  better  calibrated  than  the  47  who  did.  Their 
calibration  curves  are  compared  in  Figure  3.  Those  who  used  1.0 
expressed,  on  the  average,  slightly  greater  confidence  in  the 
correctness  of  their  answers  than  those  who  did  not  (.765  vs. 
.750),  but  got  a  smaller  portion  right  (.619  vs.  .647).  As  a 
result,  users  of  1.0  were  more  overconfident  than  non-users  (.146 
vs.  .103). 

Discussion 

The  overall  message  of  these  data  is  that  this  rather 
drastic  change  in  procedure  had  little  effect  on  confidence  as¬ 
sessment.  The  constraints  of  the  procedure  diu  induce  sorting 
subjects  to  adopt  somewhat  different  response  patterns;  however, 
the  accompanying  calibration  was  indistinguishable  from  that 
observed  elsewhere.  The  only  differences  of  any  note  are  a  weak 
suggestion  that  calibration  may  improve  as  the  number  of  cate¬ 
gories  decreases,  and  feeble  support  for  the  previous  observation 
that  people  who  do  not  use  1.0  tend  to  be  better  calibrated. 
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Figure  2.  Calibration  curves  for  the  3-pile  and  6-pile  groups 
of  Experiment  2,  compared  with  subjects  from  Lichtenstein  and 


Fischhoff  (1977). 
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From  a  practical  perspective,  these  results  are  disappoint¬ 
ing.  Despite  a  rather  concerted  effort,  we  were  no  more  success¬ 
ful  than  our  predecessors  in  devising  a  simple  scheme  for  improv¬ 
ing  the  quality  of  confidence  assessments.  From  a  theoretical 
perspective,  however,  such  negative  results  are  informative  and 
even  encouraging.  They  point  to  the  robustness  of  confidence 
effects  and  the  generality  of  previous  results. 

As  noted  in  the  introduction,  the  sort-and-label  procedure 
differed  from  traditional  procedures  on  a  number  of  dimensions. 
Had  it  had  an  effect,  subsequent  research  would  have  been 
directed  to  assessing  which  dimension  provided  the  effective 
element.  Some  of  those  dimensions  are  still  of  interest.  For 
example,  what  determines  how  fine  are  the  discriminations  in 
level  of  knowledge  that  people  believe  they  can  make?  How  do 
people  appraise  the  overall  difficulty  of  a  set  of  items  and  how 
does  that  appraisal  affect  how  people  create  equivalence  classes 
for  feelings  of  knowing?  Do  they  first  make  a  crude  partition 
(e.g.,  don't  know,  may  know,  certain)  and  then  refine  it  into 
subsidiary  categories,  or  do  they  build  categories  by  matching 
items  for  which  their  knowledge  levels  seem  equivalent?  For  the 
moment,  though,  the  dominant  impression  is  t!##t  confidence  is 
determined  by  powerful  psychological  proe«**es  vhtch  have  re¬ 
sisted  the  present  attempts  to  manipulate  them,  just  as  they  have 
resisted  most  previous  efforts. 
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